This script implements the feature-importance pipeline for the k-prototypes clustering in a descriptive, geometry-first fashion:
Re-clustering and explicit “minimal viable feature set” selection will be deferred to a dedicated follow-up script (5_parsimonious_feature_selection.Rmd), where subsets will be evaluated by how well they reproduce the full 17-feature solution (e.g., via ARI stability threshold). In the current script, all importance summaries (i.e., centroid separation, MCR, and SHAP) are treated as continuous characterizations of how features drive the 2-cluster geometry, with any high-stringency “keep” sets used for sensitivity descriptions rather than as the sole definition of “important” features
Goal: Load packages, create output paths, configure parallel workers from SLURM, initialize logging, and generate a reproducible seed list shared across array tasks. Also define IO helper(s)
Set-up beginning
Set-up complete
Goal: Load the cleaned baseline risk dataframe and the final fitted k-prototypes object (scaling currently = z_score, lambda from lambdaest). Compute baseline cluster assignments from the anchor model and coerce risk_dt column types to match the fitted object to ensure consistent distance calculations in downstream predictions
Loaded k-prototypes (from list): k=2, lambda=3.31568
Sanity: predict(kp0, kp0$data) agrees with kp0$cluster in 100.0% of rows.
Goal: Implement reusable helpers aimed at the following:
permute_col() - type-preserving permutation of a given variable used for cluster reassignmentfeature_mcr() - permutation MCR vs a base assignment for a single featureshadow_baseline() - global shadow baseline per seedshadow_baseline_feat() - shadow baseline for each featurecompute_redundancy() - |rho| and Cramer’s V flags for redundancy among retained featurescentroid_profiles() - descriptive numeric/categorical summaries for the centroid of each clusterGoal: Compute and save descriptive centroid profiles (numeric means/SDs and categorical modes/proportions) for reference, then summarise geometry-based separation and add collinearity + visual diagnostics
Goal: Quantify how far apart the clusters sit along each numeric feature in the z-scored space (delta-z, Cohen’s d, and ANOVA p-values) and rank features by geometry-based separation to inform later visualization and interpretation
Goal: Characterize the correlation/collinearity structure among the 17 risk features, highlighting correlated “blocks” (e.g., symptom burden) that may behave interchangeably in permutation and SHAP analyses
png
2
Goal: Provide a visual sanity check that clusters separate in 1D/2D space along the top numeric features, using univariate densities, bivariate scatterplots with centroids, and a PCA view to complement the centroid and MCR rankings
Goal: For a given seed, re-fit k-prototypes (same k, lambda, nstart), compute per-feature MCR with biter permutations, evaluate the shadow-max threshold, and emit a per-seed CSV. These are treated as continuous descriptive measures of how sensitive the cluster labels are to perturbations of each feature, not as a sole keep/drop gate for said risk features
Goal: Run a single-seed task when launched as a SLURM array job, or iterate all seeds locally when testing without SLURM. Array tasks exit early after writing their outputs
Goal: Aggregate per-seed outputs to compute mean MCR, SD, and win-rate (fraction of seeds where MCR exceeds feature-specific shadow thresholds). Treat these as continuous rankings of how sensitive the clustering solution is to perturbations in each feature. A high-stringency “keep” subset (stability + FDR) is still derived but used primarily for sensitivity descriptions, not as the sole definition of “important” features
Goal: Train a regularized XGBoost surrogate to predict cluster labels with stratified v-fold OOF evaluation (balanced accuracy and macro-F1). Fit a final model on all data to compute SHAP global importances for descriptive corroboration of which features contribute most to discriminating the 2 clusters. SHAP is used here exclusively for visualization/triangulation, not as a hard selection criterion
png
2
Goal: Record parameters (n, p, k, lambda, scaling, biter, n_seeds, shadow_B, stability_keep) and container hash to run_log.json for traceability of this run